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Abstract 

This article presents an algorithm that generates a conservative 
confidence interval of a specified length and coverage probability for 
the power of a Monte Carlo test (such as a bootstrap or permuta- 
tion test). It is the first method that achieves this aim for almost 
any Monte Carlo test. The existing research on power estimation for 
Monte Carlo tests has focused on obtaining as accurate a result as 
possible for a fixed computational effort. However, the methods pro- 
posed do not provide any guarantee of precision, in the sense that they 
cannot report a confidence interval with guaranteed coverage probabil- 
ities. In this article the computational effort is random. The algorithm 
operates until a confidence interval can be constructed that meets the 
requirements of the user, in terms of length and coverage probabil- 
ity. We show that, surprisingly, by generating two more datasets than 
what might have been assumed to be sufficient, the expected number 
of steps required by the algorithm is finite in many cases of prac- 
tical interest. These include, for instance, any situation where the 
distribution of the value is absolutely continuous or if it is discrete 
with finite support. R-code implementing the algorithm is available 
from the authors and will be integrated into an R-package available 
on CRAN. 
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1 Introduction 



The most common measure of the performance of a statistical test is its 
power, /3, defined as the probabihty that the null hypothesis will be rejected 
if the data follow a given probability distribution, P. In this context the 
p-value is a random variable, p, and the power is 

P = F\p<a], 

where a is the level of the test, e.g. a = 0.05. The power helps choose 
between tests, determine the probability of detecting an effect if it is there 
or simply verify that under the null hypothesis the rejection probability is 
not higher than the level of the test. 

This article describes a procedure to compute a conservative confidence 
interval for the power of a general Monte Carlo test, e.g., a permutation or 
bootstrap test. To our knowledge it is the first method that achieves this. 

Monte Carlo tests are tests where the p- value is estimated as the propor- 
tion of simulated test-statistics under the null hypothesis that are as 'extreme' 
as the observed test-statistic. 

Example 1 (Permutation test). Suppose that we want to test whether the 
mean of observations in a group of interest Q — {Gi, Gk} is larger than 
the mean of observations in a control group C — {Ci, C/,}. Assuming 
that the samples in Q and C are independent, a permutation test can be 
performed based on the difference of the average values of the groups, i.e. 
T — Q — C. The replicate Tj is formed by randomly partitioning the pooled 
sample {Gi, ...,Gk,Ci, ...,Ci,} into two groups Qj and Cj of size K and L 
respectively and computing Tj = Qj — Cj. The p-value p — P(T!,- > T\Q,C) is 
then usually estimated by p— ^ Sj=i ^V^j — -^1 • 

In the above, the power could be estimated as the proportion of rejections 
ip < a) in N simulated datasets under P. However, with this estimate of p, 
the probability of 'wrongly rejecting' (finding p < a when in fact p > a) or 
'wrongly accepting' (finding p > a when p < a) depends on p. This naive 
approach of performing a Monte Carlo test with a fixed number of replicates, 
M, on each of the N datasets (generating a total oi N x M rephcates) can 
therefore lead to biased results. 

We are estimating P[p < a], the theoretical power, not P[p < a]. The 
second quantity depends on how the user chooses to estimate the p-value, for 
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example when using the naive approach it depends on the choice of M, and 
so has less intrinsic meaning. When implementing a Monte Carlo test, we 
recommend the procedure of Gandy (2009) which makes it almost impossible 
to reject or accept the null hypothesis if p > a or p < a respectively. As a 
result, if this procedure is used, the practical probability of rejecting under 
P is virtually indistinguishable from P = P[p < a]. 

More advanced methods than the naive method for computing the power 
of a Monte Carlo test have been proposed. Oden (1991), for instance, has 
investigated how to choose the relative sizes of N (controlling the variance) 
and M (controlling the bias), to minimize the total estimation error for cer- 
tain distributions of p. Boos and Zhang (2000) partially correct the bias by 
extrapolation. 

However, the procedures that have been proposed do not provide a formal, 
finite-sample guarantee on the accuracy of /3 for a general test. This is partly 
because the problem has always been approached with the principle of finding 
as accurate an estimate as possible for a fixed computational effort. 

In this article we approach the problem with the priorities reversed: we 
wish to make exact statements about the result, in the sense of reporting a 
confidence interval with conservative coverage probability for /3, allowing the 
computation effort to be random. 

In Section |2] we describe how a confidence interval for the power can be 
obtained by running Monte-Carlo tests simultaneously and indefinitely 
until a user-specified confidence interval length and coverage probability is 
met. We demonstrate in Theorem [T] that, under very mild conditions, the 
algorithm terminates in finite expected time for a (somewhat surprising) 
minimum choice of N. Sections|3]and|4]present some additional methodology 
to reduce the computational effort. The effect of these improvements is 
illustrated via a simulation study in Section [5| In Section [6| we suggest 
using an adaptive rule where the precision required depends on the region in 
which the power is estimated to be, ensuring that the computational effort is 
only high if the power estimate is in a region of interest. Finally, in Section 
[7] we demonstrate the use of our algorithm on the simple permutation test 
example considered in Boos and Zhang (2000). Proofs of the main results 
and auxiliary lemmas are in the appendix. Within these. Lemma |4] confirms 
an observation made in (Gandy, 2009, main text p. 1507 and Figure 4) about 



the distance between certain stopping boundaries. R-code implementing the 
algorithm is available from the authors and will be integrated into an R- 
package available on CRAN. 
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2 The basic algorithm 



When computing the theoretical power of a Monte Carlo test, like a bootstrap 
or permutation test, the following problem is implicitly present. This problem 
is the main concern of the present article. 

1. Under a probability measure P, there is a random variable p with sup- 
port on [0, 1]. The distribution of p is unknown and we want to estimate 
(3 = P[p < a] for some fixed a G [0, 1]. 

2. Arbitrarily many independent replicates of p can be generated. These 
are not observable. 

3. For each replicate pi of p, arbitrarily many independent Bernoulli repli- 
cates Xj can be generated with P[Xj = 1] = pi. Only these replicates 
Xj are observable. 

In the context of a Monte Carlo test, a is the level, p is the p- value and /3 is 
the theoretical power. 

Example 2 (Permutation test). In the permutation test mentioned in the 
introduction, datasets Q\0 are simulated from P. For each dataset, the base 
test- statistic T* = is compared to Tj formed from a random partition of 

the pooled sample (Gl, G*^, C[). Here, X] = I[T; > T'] is Bernoulli 

with success probability pi = P(Tj > T*|^*,C*), the actual frequency that 
rpi y rpi (j^g^ qH partitions of the ith pooled sample. Typically, this is not 
feasible to compute, making Pi not observable. 



2.1 Determining if a value is less than a threshold 

Our approach will need to determine with low error probability whether 



Pi < a- For this, we use the sequential procedure of Gandy (2009), which we 
now describe briefly. 

Let {Xj : j E N) denote a sequence of independent and identically dis- 
tributed Bernoulli random variables with unknown success probability p. 
Given a pre-specified error probability e > 0, the procedure reports an esti- 
mate p of p such that for all p G [0, 1], 



Pp[I(p < «) ^ I{p < a)] < e. 



(1) 
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i.e. the probability of p and p being on different sides of a is bounded by e. 
The procedure repeatedly updates the partial sum St = X]j=i -^j with a new 
realisation Xt. It terminates at a time r when St hits either an upper barrier 
{Ut : t G N) or a lower barrier (L^ : t G N), i.e. 

r = min{t : St > Ut or S'f < Lt}. 

An example of the boundaries Ut and with e = 0.01 and a = 0.05 is 
depicted in Figure [T] More precisely, Ut and Lt are two integer sequences 
that are recursively defined based on the case p = a via 

Ut = min{j G N : P,(St > j, r > t) + P„(5, > f/,, r < t) < q}, 

Lt = max{j G Z : P„(^t < j, T>t) + P,(5, < L,, r < t) < e^}, (2) 

where et is a spending sequence with et e as t ^ cxo. The maximum 
likelihood estimator p = Sr/r satisfies ([T]), see Gandy (2009, Theorem 2). 



2.2 The proposed algorithm 

We refer to each of the Bernoulli sequences as a stream. In our proposed algo- 



rithm the sequential procedure of Gandy (2009) is applied to independent 



streams simultaneously. We say that a stream stops with a positive outcome 
if the lower boundary is hit (the test on this dataset was significant, the null 
hypothesis was rejected, pi is reported to be smaller or equal to a) and a 
negative outcome if the upper boundary is hit. While neither boundary is 
hit, the stream is unresolved. 

The algorithm terminates when enough streams have been resolved to 
compute a confidence interval (CI) for /3 with a given coverage probability 
1 — 7 and a length not larger than a pre-specified value A. The following is 
the basic algorithm that we are proposing. 

Algorithm 1 (Basic algorithm). 
for i = 1, . . . , N 

Initialise stream i 
Let Sl = 

Let t=0; Ro = 0; Ao = 0; Uq = {1, . . . , N} 
while \I{Rt,At,\Ut\;i)\ > A 

Lett = t + 1, Rt = Rt-i, At = A-i, l^t = Ut-i 

for i EUt 



5 




Figure 1: Confidence intervals generated by the algorithm using TV = 4, 
e = 0.01, a = 0.05, = et/(1000 + t) and 7 = 0.05. 



Generate XI 

Let Si = Sl_-^ + X\ 

If Si > Ut let At = At + I, Ut=Ut\ {i} 
If Si < Lt let Rt = Rt + 1; Ut = Ut\ {i} 
Report I{Rt, At, \Ut\; 7) as confidence interval for (3. 

Lit is a set containing the indices of the streams that have not stopped by 
time t, i.e. have not hit either of the boundaries. \hlt\ denotes the number of 
elements in Ut- Rt and At count respectively the number of positive outcomes 
(rejections) and negative outcomes (acceptances). I{Rt, At,\Ut\;'y), to be 
defined in the next subsection, denotes a confidence interval for (3 based on 
Rt, At and \Ut\. Its length is denoted by \I{Rt,At, \Ut\;'j)\. The interval will 
retract as further streams are resolved until, assuming N is large enough, the 
desired length is reached. 

Figure [T] illustrates the algorithm in a toy example with only = 4 
streams. The thin hues depict the 4 corresponding partial sum sequences, 
SI- When SI hits one of the boundaries the stream is stopped, and the p- 
value is reported to be either larger than (if Ut is hit) or smaller or equal to 
a (if Lt is hit) with error probability less than e. The CI for /3 (annotated at 



the top of the graph) retracts every time a stream stops. In Subsection 2.3 
we describe how this interval is computed. 
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2.3 The Confidence interval 



Suppose we have streams and observe all of their outcomes. Because of 



the "uniformly bounded resampling risk" (Gandy, 2009), the outcome of one 
resolved stream, I[p < a], is Bernoulli with success probability in the interval 
[(1 — e)/3, (1 — e)/3 + e]. Using this, it can be seen that the following interval 
Xoo is a conservative confidence interval for /3 with coverage probability 1 — 7: 



73! 



/3! 
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where R^o (^00) denotes the number of positive (negative) outcomes and /3* 
and are such that for Binomial random variables B~ and with size 
+ -Roo and respective success probabilities I3*_ and I3\. we have 

P[5- > = = P[5+ < 

The subscript in Xqo represents that this is the interval that would be obtained 
by our algorithm if it were allowed to run indefinitely. 

We need to extend this to a situation with unresolved streams, whilst 
keeping a conservative confidence interval. We do this by taking the union 
of all intervals we could get from all possible outcomes of the unresolved 
streams. This guarantees the coverage probability no matter how the zth 
stream being resolved by time t depends on pi. 

To be precise, the confidence interval at time t in our algorithm is obtained 
by letting 

Xt = I{RtAtAUt[n), 

where 



J(r, a, 7) = |J Xoo(roo, r + a + m - Too; 7)- 



(3) 



roo=r 



Evidently, by construction, Xi 3 X2 ^ ■ ■ ■ ^ Xqo and 



P[/3 G Xi n ... n /3 G Xt n ... n /3 G Xoo] > 1 - 7. 



2.4 Expected time 

A simpler algorithm than Algorithm [T] would be the following: start A^ in- 
dependent streams and wait until all have been resolved. The number of 
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streams N can be chosen such that, no matter what the outcomes are, the 
confidence interval length will be shorter than A. However, this algorithm is 
unusable in practice as it requires an infinite expected effort. Indeed, ( [Gandy 



2009, p. 1506) shows that if the cumulative distribution function (CDF) of p 
has a non-zero derivative at a, which is a very common case, then E[rj] = oo, 
where Tj denotes the stopping-time of stream i. Thus the overall expected 
effort E[e] = E[^ r,] = oo. 

We next show that with our algorithm we can choose N and et such that 
the expected effort is finite. The key is to make N large enough such that 
not all streams have to be resolved. 

In Algorithm [TJ the effort is 

AT 

e = min{ri, r^N-k)}, (4) 

i=l 

where Tj denotes the stopping-time of stream i, k is the number of streams 
that are unresolved when the algorithm finishes and r(i) < ■ ■ ■ < T(^n) denote 
the order statistics of ti, ...,tn- 

For any k > 1, by choosing N large enough and e small enough, we can 
ensure that k is least k. The effort is then bounded above by T(^n_i^)N. Thus 
to ensure that E[e] is finite, it suffices to prove that E[r(7v-ft)] < oo for some 
K. The following theorem shows that in many cases k can be taken as small 
as 2. 

Theorem 1. Suppose that e < 1/4 and there exist constants X > 0, q > 1 
and T G N such that tt — ej_i > At^^ for all t > T. Further, suppose that 
in a neighbourhood of a the CDF of p is Holder continuous with exponent ^. 
Then E[r(j)] < oo for i < N — [2/^\. In particular, if ^ = \ (the CDF is 
Lipschitz continuous in a neighbourhood of a) then E[r(7v_2)] < oo. 

A function F is Holder continuous with exponent in a neighbourhood 
of a if there exists an open interval U containing a for which there exist a 
c > such that for all x,y eU, \F{x) — F{y)\ < c\x — y\^. 

The first set of conditions of Theorem [T] can be satisfied in any situation 
by an appropriate choice of the spending sequence and e. In fact, in the 



R-package simctest corresponding to the article Gandy (2009), the default 
spending sequence et = et/(1000 + 1) satisfies the conditions with A = 1 and 
q = 2, assuming one chooses e < 1/4. 
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Whether the second set of conditions can be satisfied depends on the 
testing problem, although in many examples ^ = 1, for instance whenever 
the distribution of p is absolutely continuous and has a bounded density in 
the neighbourhood of a. If the distribution of p is discrete and has finite 
support, then ^ = 1 if there is no probability mass at a. Here, even if there 
is mass at a, if the support of p is finite (e.g. in a permutation test), it is in 
principle possible to find a' > a such that 

(3 = F[p < a] = F[p < a'], F[p = a'] = 0. 

The entire algorithm can then be applied to a' instead of a, and the situation 
is effectively one where ^ = 1- 

Henceforward the conditions of Theorem [T] are assumed to be satisfied 
with ^ = 1. The algorithm will meet the user-specified precision requirements 
with a finite expected effort if it will terminate by time T(^n~2) with probabil- 
ity one, or if P[|Ir^^_2) | > A] = 0. As can be verified, with A^— 2 of streams 
resolved the largest possible CI length occurs when there are [(A^ — 2)/2j pos- 
itive outcomes. A^ must therefore satisfy |J([(Ar-2)/2j, \{N -2)/2],2;-f)\ < 
A. We shall call the minimal such A^ the blind minimal N , A^. 



3 Choosing the number of streams 

The algorithm so far gives the desired guaranteed performance, however, 
see Section |5} the computational effort can be large. So far the algorithm 
depended on the user specifying A^ subject to A^ > Ag. In Section |3.1| we 



introduce a pilot sample to reduce the minimal A^. In Section [372] we suggest 
a method to approximate the effort as a function of A^, using the information 
obtained in the pilot. This allows a choice of A^ that is adapted to the 
unknown j9-value distribution. 



3.1 Reducing the simple minimum N 

Suppose that we observe a Binomial variable B with size m. The length 
of the confidence interval is roughly proportional to — TT)/m, where 

jf = B/m. As a result, the length can be considerably larger when B is 
close to m/2 than at the extremes B = and B = m. With our confidence 
interval, we similarly have a much larger CI length if of A^— 2 resolved streams 
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the number of positive outcomes is close to [(A^ — 2)/2j, as opposed to or 

The minimal N ensures that for any outcome from N — 2 of N streams 
the length of the confidence interval is at most A. However, if the true power 
is close to zero or one, the probability that the number of positive outcomes 
at time T(^n-2) will be close to (A^ — 2)/2 will tend to be minuscule. In this 
case, it seems rather wasteful to choose an as large as to safe-guard 
against an unlikely scenario. 

To reduce the minimal N, we propose to first obtain a pilot sample, where 
n streams are run and stopped at a maximum number of steps tmax, obtaining 
a preliminary confidence interval Xp = /(-R-p, Ap, \Up\; 7p), where I is defined 
in (jsj), 7p is some pre-specified value (substantially) less than 7 and Rp, Ap, 
\U-p I are the number of positive outcomes, negative outcomes and unresolved 
streams. 

In the main run the following interval can then be reported 

iP = I{Rt, A,\Ut\; J- jv)nlr. (5) 

This respects the minimum coverage probability 1 — 7, since a Bonferroni cor- 
rection was used. We call the minimal N such that for all r G {0, 1, . . . , — 
2}: 

|J(r,A^-2-r,2;7-7p) nXp| < A 

the pilot-based minimal N denoted by N-p. Given the pilot it can be deter- 
mined by a computational search. 

The intersection can allow us to choose a smaller A^ than A^^. Indeed, 
after A^— 2 of A^ streams in the main run are resolved, the maximum CI length 
achievable is for a number of positive outcomes r that satisfies r/ (A^— 2) G X-p. 
As demonstrated for pilot intervals I-p to the left of 0.5 in Figure [2| the 
minimum number of streams that are needed in the main run can be reduced 
substantially, in particular, if Xp lies far to the left (or right) of 0.5. 

Heuristically, the disadvantage of a small increase in the coverage proba- 
bility from 1 — 7 to 1 — 7 — 7p can be outweighed by being able to exclude 
large intervals centered around 0.5. 

3.2 Approximation of the optimal number of streams 

In the previous section the range of possible A^s was extended from A^ > 
to N > N-p. We now describe how to choose A^ within this range in order to 
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Figure 2: Ratio of the pilot-based minimum A^, A^p, over the bhnd version, 
Njs as a function of the rightmost point max J-p of the pilot sample, with 
A = 0.01, e = 0.0001, 7 = 0.01, 7p = 7/IO. Here, Nb = 68311. 

minimize E(e), where e is defined in Q. This is achieved by predicting E(e) 
for any given based on information from the pilot sample. 

Removing the conditioning on pi, the effort of the ith stream, min{rj, T(^N-k)} 
is a replicate of a random variable a, say. We have 

E[a] = P[cr < tmax]E[o-|o- < Uax] + P[o- > tmax]E[o-|o- > tmax]- 

Based on this, and temporarily ignoring the possibility that T(^N-k) < i^max, 
we approximate E(e) as 



where ttq is the proportion of streams in the pilot sample that stopped before 
or at tmax, o"o is the average stopping-time of those streams, vfi = 1 — vfo is 
the proportion of streams that were unresolved at tmax and finally o"i is an 
estimate of E[cr|(j > tmax]- 

Since E[cr|cr > tmax] = /o°° S(t)dt, where S is the survival function of 
a\a > tmax) we will use o"i = S{t)dt, where 5* is the following approxima- 
tion for S: 



E(e) = A^[7ro(To + 7ri(3-i], 



S{t) = < CA/log(t)/t, t > tmax and t < f(^N-k), 

I 0, otherwise. 




where c 
below. 



■\/tmax/ log tmax 5 and f(^N-k) IS an estimate of r(jv_fc), described 



S{t) is the survival function that would occur if T(^N-k) was known to 
be equal to f(^N-k) and if the conditional survival function of given > 
^max, without any truncation by the algorithm, was Pfr, > t\Ti > tmax] = 
c>y\og{t)/t. This latter approximation appears to be appropriate for spend- 
ing sequences that satisfy the conditions of Theorem [T| p- value distributions 
that are 'smooth' around a, and a large enough tmax- We tested the approx- 
imation thoroughly on four distributions for p, Beta distributions Beta(l,a;) 
with X chosen such that F[p < 0.05] = 0.05,0.7,0.9 and 0.99. From 50000 
streams generated for each distribution, with termination time larger than 
i^max = 1000 (obtained by discarding those that terminated before), the av- 
erage ratios of the approximated conditional survival distribution over the 
empirical version over [1000, 10^] (evaluated at the observed stopping-times) 
were respectively 0.93, 0.93, 0.85 and 0.67. This bounds the error due to this 
approximation. 

Finally, f(^N~k) is obtained as follows. We estimate k via 

k = max{k e {2, N} : {I^'^^lPviN - k)\ , [(1 - /3p)(iV -k)],k)\< A}, 

where /3p is an estimate of f3 based on the outcomes of the streams that 
stopped during the pilot and the position relative to atmax of the streams 
that were unresolved. (The predictor of k above assumes that the proportion 
of positive outcomes will be exactly /3p.) We also predict that A''i = [ttiA^J 
streams will stop after tmax- On that basis we set f(^N-k) to be the solution 
for t of c^\og{t)/t = (iVi - k)/Ni. Special cases, e.g. where f(^N-k) < ^max 
are handled in the natural way. 

We denote by Nq the N that minimizes E(e) subject to > N-p. In 
practice it is found by a brute force search over a sensible range N-p < N < 
^max- Compared to the main loop of the algorithm, the computation time 
for this search tends to be negligeable. 

Figure [3] illustrates the performance of our approximation in an example 
where a = 0.05, 1 - 7 = 0.99, A = 0.02, e = 0.0001, = et/(1000 + t), and 
p follows a Beta(l,x) distribution with power 0.7. 

Based on a pre-simulated sample of 10^ tuples (stopping-time, outcome) 
we obtained an estimate of the minimum possible expected effort subject to 
N > Np by resampling from the tuples and emulating the operation of the 
algorithm 100 times for each of a range of choices for A^. The observed effort 
for each attempted is shown in the black squares of the figure. 

Independently we generated 100 pilots with n = 1000 and tmax = 1000 
and thus obtained 100 estimated effort-functions of which are displayed 
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20000 30000 40000 50000 60000 70000 
N 

Figure 3: Approximation of the expected effort for each N. See details in 
main text. 

in the transparent hnes. Of the 100 rephcates of No thus obtained, the 
maximum ratio of the expected effort for No over the minimum possible 
effort (both calculated via the resampling routine described above) is 1.03. 

In general, the procedure we have proposed tends to slightly overestimate 
the effort. In this example, the ratio of the approximated effort-function 
over the truth is on average 1.3. However, in this example and in general we 
have found that the minimum of the true curve and estimated curve are for 
roughly the same N and also that the optimum is quite fiat — it is mostly 
just important to avoid the regions of very high expected effort that can 
occur close to N-p. 

4 Stopping based on joint information 

In this section we describe a testing procedure that allows the algorithm to 
stop with more unresolved streams. The procedure analyses the current set 
of unresolved streams as a whole and reports a lower bound rt {at) on the 
number of p- values from the remaining streams that are less or equal to a 
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(greater than a), if both of the following hypotheses rejected: 



\Ut\ \Ut\ 

H+: J2l\p,>a]>\Ut\-rt + l, : J] < a] > - a* + 1, 

i=i j=i 

where r^, > and rj+a^ < , and the indices of the remaining unresolved 
streams are assumed to be {1, |Wt|}. We will discuss the choice of rt and 
at later. The hypotheses will be rejected for large values of the test-statistics, 

\Ut\ \Ut\-at+l 
i=rt 1=1 

where S^^^ < ■■■ < S^"^ are the ordered partial sums corresponding the re- 
maining streams, 77 is a chosen (small) positive value, and 

F,"(a;) = Fa[St < x\t > t], 

i.e. is the CDF of a cumulative sum of t Bernoulh variables with success 
probability a, conditional on not having hit either boundary by time t. This 
function can be computed recursively. 

The random variable X is said to be smaller than the random variable Y 
with respect to the usual stochastic order, denoted X <st Y, if for all a; G M, 
Fx{x) > Fy{x), where Fx and Fy are respectively the CDFs of X and Y. 
In the appendix we prove the following: 

Theorem 2. Under , T+ <,t 5+ and under , T" <st B' , where B+ 
and B~ are Binomial variables with success probability r] and size \Ut\ —rt + 1 
and \Ut\ — at + I respectively. 

Hq and can therefore be rejected conservatively when and T~ 
are significantly large for the corresponding Binomial variables. 

Using Bonferroni correction, a minimum coverage probability of 1 — 7 is 
guaranteed if for all t we compute a confidence interval 

If = I{Rt, At, \Ut\;i-iv- Ij) n Tv, 

where {Rt,At, \Ut\) = {Rt + rt,At at, \Llt\ - n - at) if the test rejects, 
{Rt,At, \Ut\) otherwise, and 7^- < 7 — 7-p is an upper bound on the overall 
probability of wrongly rejecting either hypothesis at any point in time. To 
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guarantee this bound we set a sequence of rejection thresholds ^i,^2--- such 
that 

oo 

Y.^i = ij. 6. >o, 

i=l 

Thus, at time t each hypothesis is tested at a level 

The ultimate objective is to stop the algorithm early. On this basis we 
choose Tt and at such that |X/| < A if the test rejects, in which case the 
algorithm can stop immediately. 

The procedure is mostly useful when the number of resolutions required, 
Tt + at, is small compared to the number of remaining streams \Ut\. As an 
extreme example, suppose that = 1, at = and \lAt\ = 100. In this case, 
it can be possible to conclude with virtual certainty that at least 1 of the 
100 streams has a p-value less than a, when concluding the same about any 
individual stream could require many more samples. 

In this procedure there are a number of free parameters that we set some- 
what heuristically. From a small simulation study we established that choos- 
ing 1] = 0.05 gave good results. As for rt and at, they are chosen to be equal 
and then as small as possible subject to the algorithm terminating if the 
hypotheses can be rejected, since for simple p-value distributions it is likely 
that the unresolved p- values would be roughly evenly distributed around a. 

In the simulation studies that follow and in the R-implementation, jj = 
7/10, ^t is only positive when t = ti = 2i x 10^ for i e N and Xli 6 = 
7j X 20/(20 + 

5 Simulations 

This simulation study illustrates the effort required by our algorithm and the 
effect of the improvements suggested in Sections [3] and |4j For all experiments 
we set a = 0.05, A = 0.02, I-7 = 0.99, e = 0.0001 and et = elOOO/(1000+t). 
Four p- value distributions were considered, Beta(l, x) with x chosen such that 
P[p < a] = a, 0.7, 0.9, 0.99, i.e x = 1 (a Uniform distribution) and roughly 
23.5, 44.9 and 89.8 respectively in the next three cases. The quantity of 
interest is the average total number of samples generated, previously referred 
to as the effort. 

We report the average effort based on 100 replicated runs. These are 
displayed in the left subcolumn for each distribution in Table [Tj In the 
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(3 = 0.05 
Av. (S.E.) 


(3 = 0.7 
Av. (S.E.) 


/3 = 0.9 
Av. (S.E.) 


(3 = 0.99 
Av. (S.E.) 


Optimal 
Min. 

No test 
With test 


12.3 (0.14) 
12.5 (0.16) 
10.5 (0.22) 
8.0 (0.19) 


3329 (35) 
8498 (296) 
3324 (41) 
1541 (18) 


539 (8.4) 
548 (9.2) 
568 (7.9) 
317 (5.2) 


16.2 (0.08) 
16.1 (0.08) 
10.4 (0.10) 
10.4 (0.09) 



Table 1: Average effort (in millions) of our adaptive methods ("No test" and 
"With test") compared with the minimum A^ and the optimal A^. 



right subcolumn we report the standard error of the corresponding estimate 
based on the usual Gaussian approximation, i.e. the standard deviation of 
the sample divided by y/lOO. 

In the first two rows, we report the average effort of the optimal A^ (which 
would not be available in practice) and the minimum A^, Ag, when using 
Algorithm [T] without any of the improvements suggested in Sections |3] and 
|4j These were computed by resampling from 10^ pre-simulated replicates 
of the tuple (stopping-time, outcome), for each distribution, from which we 
emulated the operation of the algorithm. (Finding the optimal A^ would 
otherwise have taken too much time.) 

In the third and fourth rows we report the average effort of the algorithm 
with the proposed improvements. The third row illustrates the improvements 
of Section[3} which concerns the choice of A^, setting 7-p = O.I7. In the fourth 
row we additionally implemented the test on joint information, described in 
Section |4| with •yj = O.I7. In both these rows each value represents the 
average effort observed from actually running the algorithm 100 times. Each 
run used its own pilot sample consisting of 1000 streams forced to terminate 
after 1000 steps. The effort of the the pilot is included in the report of the 
average effort. 

First consider the difference between the third and fourth rows of Table 
[TJ The testing procedure can reduce the effort substantially, namely by 24%, 
54%, 44% in the first three cases, although in the last case the reduction is 
not significant. 

For the Uniform and Beta distribution with power 0.99, the optimal A^ 
and Ns turn out to be equal. As a result, the reduction of the effort seen in 
the third row over the first two rows is mostly due to the intersection method 



described in Subsection 3.1, which has allowed a smaller choice of A^, N-p. 



For the Beta distribution with power 70%, the effort for the minimal A^, 
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in the second row, is over 2.5 times larger than for the optimal A^, in the first 
row. As result, in this example it was crucial to estimate this optimum, by 



the procedure described in Subsection 3.2 The difference between the effort 



for the optimal (which would not be known in practice) and the adaptively 
chosen Nq is not significant (although in this example enough simulations 
would show that the optimal A^ still performed better). As previously men- 
tioned, introducing the testing procedure in this example further reduces the 
effort by a considerable margin, as demonstrated in the fourth row. It is of 
some comfort that the best improvements from the methodology of Sections 
|3] and |4] were found in the computationally most demanding scenario. 

In the third row, for the Beta distribution with power 90%, adaptively 
choosing A^ actually increased the effort, although not substantially. The 
average No chosen is roughly 10000, whereas in the second row is 17055 
(for this distribution it is also the optimal A^). We would expect to reduce 
the effort on this basis. However, this does not appear to completely com- 
pensate for the effort of the pilot and the error in coverage probability lost 
in computing the pilot-based CI. However, with the test we reduce the effort 
by 40% and improve on both efforts reported in the first two rows for this 
distribution. 

Overall, from these experiments it seems that our suggested improve- 
ments reduce the expected effort substantially, as is best summarized in the 
difference between the bottom row and either of the first two. 

For future reference, the default settings of our algorithm are those of the 
bottom row, namely: e = A/200, = elOOO/(1000 + t), 77, = 7^ = O.I7 and 
a pilot sample of 1000 streams terminated at tmax = 1000. 



6 Adaptive CI Length 

The expected computation time of our method (assuming it is not run in 
parallel) is the time it takes to perform one resampling step (which depends 
on the problem at hand) times E[e], where e is defined in Q. When one 
resampling step is computationally demanding, the expected efforts hsted in 
Table [l] may appear prohibitive. In this case, we recommend relaxing the 
fixed requirements on A, i.e. allow A to depend on the 'location' of the 
confidence interval. This can reduce the expected effort of the algorithm 
substantially. 

As a rule of thumb, the closer the power is to 0.5 the higher the expected 
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effort (compare for instance tlie efforts for (3 = 0.05 and (3 = 0.7 in Table [T|: 
firstly because the p- value distribution tends to have more mass around a, 
meaning that each stream in the algorithm has a higher expected running- 
time, and secondly because the length of the confidence interval is largest 
when there are the same number of positive and negative outcomes. 

On the other hand, we anticipate that if the power is indeed around 0.5 
or for that matter anywhere in the interval [0.1, 0.9], say, the user will often 
only require a small enough confidence interval to conclude that (3 is not 
close a or 1. Indeed, a typical reason why one needs the power of a test is 
to check that the probability of rejection under the null hypothesis is close 
to a (which is typically small) or that under an alternative hypothesis /3 is 
close to 1. 

Let C = {(3 E [0, 1]^ : f3i < (32} denote the set of all possible confidence 
intervals for /3. We allow the analyst to pre-specify a subset of C, A say, 
such that if the current confidence interval is an element of A the algorithm 
terminates immediately. 

It is reasonable to enforce that A satisfy the following three properties: 

(i) A is closed. 

(ii) {(/3, (3)'^ : (3 E [0, 1]} C A (all empty confidence intervals are allowed). 

(iii) \fl3 E A : \/a e C : j3i < ai < a2 < 132 ^ a E A {a subinterval of an 
allowed confidence interval is allowed). 

The following result shows that specifying A is equivalent to specifying the 
maximum CI length allowed as a function of the confidence interval midpoint. 

Lemma 3. Suppose that A C C satisfies (i-iii). Then there exists a function 
A : [0, 1] [0, 1] such that for all (3 e C: l3 e A ^ (32 - (3i < A{^^). 

All of the theory we have presented in Sections |2]-[4] can be incorporated 
unaltered into an algorithm with adaptive A, with the single exception that 
finding N-p requires a brute-force search — one must ensure that A(M) will 
be met after N — 2 streams have stopped, for any possible CI midpoint M 
arising from all the possible outcomes of — 2 streams. 

The effort of our recommended method for fixed A is repeated from the 
fourth row of Table [T] to the first row of Table |2} These results are equivalent 
to a case where for all M e [0, 1], Aq{M) = 0.02. In the next rows of Table 
|2]we present the average effort of the algorithm for three other functions of 
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/3 = 


0.05 


/3 


= 0.7 


/3 


= 0.9 


/5 


= 0.99 


Function 


Av. 


(S.E.) 


Av. 


(S.E.) 


Av. 


(S.E.) 


Av. 


(S.E.) 


Ao 


8.0 


(0.19) 


1541 


(18) 


317 


(5.2) 


10.4 


(0.09) 


Ai 


7.8 


(0.20) 


185 


(3.2) 


131 


(2.3) 


26.2 


(0.77) 


A2 


8.4 


(0.46) 


17.1 


(0.46) 


9.0 


(0.06) 


5.5 


(0.08) 


A3 


8.4 


(0.46) 


0.7 


(<0.01) 


0.6 


(<0.01) 


0.5 


(<0.01) 



Table 2: Average effort (in millions) for different functions of the CI midpoint. 






A„(M) 







--- A,(M) 







A2(M) 






A3(M) 







0.0 0.2 0.4 0.6 0.8 1.0 

M 

Figure 4: The four midpoint functions Aj used in Table [2j 

the midpoint, all of which are illustrated in Figure |4} Depending on what is 
easiest to present, the rule is described through A or by the equivalent A. 

1. Ai(M) = 0.02v/M(l - M)/(\/0.05 ■ 0.95). A function that allows 
roughly the same number of streams to remain unresolved for any 
/3. Because the CI midpoint cannot be or 1 exactly the fact that 
A(0) = A(l) = is not problematic. 

2. A2 is the largest set of confidence intervals that satisfies (i)-(iii) and 
that satisfies V/3 e : /32 - /3i < 0.1 and V/3 G A2 with (/3i < 0.05 or 
132 > 0.95): 132 - I3i< 0.02 — a CI length of 0.02 is needed for high or 
low powers, but a CI length of 0.1 is admissible otherwise. 

3. ^3 is the largest set of confidence intervals that satisfies (i)-(iii) and 
that satisfies V/3 G A3 with f3i < 0.05: /32 - A < 0.02. A precise 
estimate is only required if the confidence interval is at least partly to 
the left of a and any interval is admissible otherwise. 

For the Uniform distribution, since all rules have A(0.05) = 0.02, we 
would expect the effort to be comparable, as is observed. On the other hand, 
we see a dramatic reduction of the effort in other columns where the rule has 
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A/o- 


0.5 


1.0 


1.5 


2.0 


Truth 

Our method 
Boos & Zhang 


0.1830.184o.l85 
0.1820.185o.l92 

0.175(0.006) 


0.44l0.442o.443 
0.44o0.443o.450 

0.439(0.008) 


0.7280. 729o. 730 
0. 7260. 729o. 736 

0.731(0.007) 


0.9120.912o.9l3 
0.91o0.914o.920 

0.921(0.005) 



Table 3: Power of the permutation test for the difference of means. 



allowed less precision. Overall, if we consider for example the effort for A2, 
we hope that with this compromise the algorithm can be used in practice for 
moderately complicated tests. 



7 Example: Permutation test 



Using exactly the example of Boos and Zhang (2000 ), we computed the power 



of a permutation test on the difference of the means of two Gaussian samples, 
with sizes K = 4 and L = 8, identical standard deviation a and standardized 
differences (/ig — fic)/(^ = 0.5, 1, 1.5, and 2. We used a fixed A = 0.01 and 
coverage probability 0.99. Our other parameters were set to the defaults 
listed at the end of Section [H The results are in Table [H 

In three of the four cases our confidence interval excludes the correspond- 



ing estimate in Boos and Zhang (2000) (although not after adding or sub 



tracting two of their standard errors). Of course, our computational effort is 
considerably larger — but our key contribution is in providing a mechanism 
that guarantees the precision of the result. 

In this simple example it is in fact possible to compute the p- value of each 
dataset exactly by evaluating all 495 permutations. Because of this the power 
can be estimated by standard methodology with a Binomial-based confidence 
interval. In each case, a very accurate estimate of (3 was obtained by gener- 
ating 10^ datasets and computing the p-value for each exactly. The resulting 
estimates are presented in the first row of Table |3} using the convention 
to mean that the estimate is x and the confidence interval is [a,b]. In the 
second row we present the results of our algorithm, using a fixed A = 0.01 
and coverage probability 0.99. In all cases, the 'true' power falls within our 
estimated confidence interval, as would be expected. For the convenience of 
the reader, the third row presents the estimated powers and standard errors 



computed in Boos and Zhang (2000). 



20 



8 Conclusions 



We have proposed an open-ended algorithm that computes a conservative 
confidence interval for /3 without (almost) any assumptions on the distribu- 
tion of the p- value, see Theorem [TJ In practice, the method can be computa- 
tionally expensive. However, a set of improvements described in Sections [3] 
and |4] reduce the computational effort for fixed A by a sizeable margin. By 
use of an adaptive A, it can be also be ensured that the effort is only high if 
the power is in a region of interest, where a high precision is required. 

There remain areas of potential improvement: for instance the balance 
between the error spent on e, the pilot and the testing procedure could be 
explored in more depth, as well as the choice of the spending sequences et and 

The test for stopping based on joint information in Section|4]is somewhat 
ad-hoc, and conceivably a more powerful test could be derived. Finally, of 
course, the computational effort could also potentially be reduced by making 
additional assumptions on the p-value distribution. 

How conservative is the confidence interval? From a few simple experi- 
ments, we have found the length to be roughly twice as large as it needs to be 
for the nominal coverage probability. Although we have been conservative in 
many aspects of the algorithm, this disparity appears to be almost entirely 
due to the contribution from unresolved streams in ([3]). This is effectively the 
price of making almost no assumptions on the distribution of the p-values. 

A Finite expected stopping time 

The proof of Theorem [T] requires the following preliminary lemmas. 

Lemma 4. // there exist constants X > 0, q > and T G N such that 
et — ef_i > Xt^'^ for all t >T, then, 



Lt> 



ta+ ^/t{q\ogt- log X)/2 
ta - v^t(glogt-logA)/2 



t > T. 



Proof of Lemma \^ Let t > T and let = ta + \/t{q log t — log A) /2 
The expression inside the square- root is non- negative since 1 > — et_i > 



21 



At By Hoeffding's inequality (Hoeffding, 1963), 

Pa(r >t,St> U;) < F^iSt > U;) = V^{Stlt -a> U*Jt - a) 
< exp{-2t{U*/t - a)^} < Xr'^ < et - et-i. 

Furthermore, by the recursive definition of Ut and in (|2]), 

Pair <t,S^> Ur) = Pa(r > St-i > Ut-i)+Fa{r < t-1, > U^) < et_,. 
Hence, 

Pair >t,St> U;) + P,(r <t,Sr> Ur) < et. 
Thus, by (|2]), f/f < . The lower bound for Lt can be found similarly. 

□ 



The above formally confirms the observation in Gandy (2009, main text 
p. 1507 and Figure 4) that Ut — Lt appears to be proportional to yjtlogt for 
large t. Indeed, the spending sequence used, et = et/(1000 + t), satisfies the 
conditions of the lemma with A = 1 and g = 2 (if one chooses e < 1/4). 

Lemma 5. Suppose that in the neighbourhood of a the CDF of p is Holder 
continuous with exponent ^, that the conditions of Lemma hold, and that 
e < 1/4. Then, for any r] G (0, 1), there exist a constant n and a time T such 
that 

P(r >t)< 2e-^'' + Kt^^^-'^/^ t > f. 

Hence, 

P(r>t) = o(t"), for any d > —^/2. 
Proof of Lemma\^ Let F be the CDF of p. Then, for any t G N, 

P(r >t)= /{[0,p,-]} + I{iPt,Pt)} + I{[pt, 1]}, 

where I{A} = Jj^^Ppir > t)dF(p), Pp(r > t) is the survival function of the 
stopping-time of a stream generated by a p-value p, and < Pt < a < pf < 
1. When < p < Pt and Lt/t — p^ > 0, 

Ppir >t)< P,iSt > Lt) < Pp-iSt > Lt) < exp{-2tiLt/t-p^)^}, 

using Hoeffding's inequality for the rightmost bound. It follows that if we 
define 

p- = max{Lt/t - 0}, t G N 
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for some G M, then 

Pp(r > t) < exp{-2t''}, 0<p<p;,ten. 
Do we have < < The lower bound is obvious. The upper bound also 



holds, since the proof of Theorem 2 in Gandy (2009) shows that if e < 1/4 
then Lt/t < a for all t eN. 

Similarly we can define pf = mm{Ut/t + t^''^~^^^'^, 1}, t G N, guaranteeing 
that a < pf < 1. Then, for any r/ G M, 

Pp(r > t) < exp(-2t''), p+ <p<l, teN. 

We therefore have, 

I{[0,Pi]} + I{[pt,l]} <2expi-2n. (6) 
It remains for us to obtain a bound on I{{p^ ,pf)}. Using Theorem 1 of 



Gandy (2009), Ut — at = o{t), at — Lt = o{t). Thus, by restricting r/ < 1, 
pj' — )• a, p'l — )■ a and there exists a time T* such that F is Holder continuous 
over {p~[ ipt) for all t >T*. It follows that for some constant h > 0, 



HiPi^Pt)} < I dF(p) < F{pt) - F{p-) < h{pi 



Pt 



t > T*. 



Let T = max{T, T*, 2}, where T is defined in Lemma |4| For t >T 

I{{P7,Pt)} < h{pt -Pif 

2^(r?-i)/2 ^ 2[v/t(glogt-logA)/2 + ' 

1? 



< h 

< h 

< h 



< 



2^(„-i)/2 ^ 2[^{q + a)/2,/thii + l]/t 
h [{2 + c)t^'i-^'>/^] ^ , (requiring r/ > 0) , 



t=T 



where a = max{0, -logA/logf }, b = 2{^/{qTa)/2 + l), c = b^/\ogt/t 

We needed T > 2 in the definition of a and used it in the third inequality (1 < 
■\/21og2). Using (ml), the proof is complete after we take k = h{2 + c)^. □ 
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Proof of Theorem^ The (A^ — A;)th order statistic has a survival function 
(jEmbrechts et al.[ |1997"1) 



P(r(^„fc)>t)= J2 ( ■)nr>tf-'nr<ty <CiPir>t)'^\ 
for t > and some ci < 0. Therefore, using Lemma |5] 

oo oo oo 



t=0 



t=l 



for all d > —^/2, with C2 chosen based on ci and The summation in the 
right-hand side is finite if the exponent of t is strictly smaller than —1. [2/^ J 
is the smallest possibility for G N such that there exists a. d > — ^/2 with 
{k + l)d<-l. □ 



B Hypothesis test 

The proof of Theorem |2] first requires the following lemma. 

Lemma 6. Suppose that Xj andXj are two sequences of independent Bernoulli 
variables with success probabilities tti and n2, respectively, where < tti < 
7f2 < 1, and put = J2]=i for k = 1,2. Let {k : t E N} and {ut : t E N} 
be two arbitrary integer sequences and define the random variable 



oo ifk< < Ut for all t eN, 



min{j : S'j < Ij or Sj > Uj} otherwise. 

Then P[rfc > t] > for k = 1, 2, 

[Sl\ri>t] <st [S^\r2>t]. 

Proof of Lemma\^ With the conditioning on Ti,T2 > t removed, it is known 
that 8} <st since the variables being compared are Binomial, see e.g. 



Boland et al. (2002, Theorem 1 (iii)). In order to show that the same holds 
with the condition ti,ti > t, we use a stronger form of stochastic ordering: 
for two discrete RVs X and Y, X is smaller than Y with respect to the 
likelihood ratio order, denoted X <ir Y, if 

'{^f } I X, on the support set of Y, (7) 
fY{x) 
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where fx and fy are the probabihty mass functions (PMFs) of X and Y 
(Keilson and Sumita 1982 p. 184). Further, following [Keilson and Geber 
(1971), a discrete RV Z has a log-concave distribution if 



fzixf>fzix-l)fzix + l), xEN. 

We have [SUn > 1] = Xj <ir Xf = [Sfln > 1] and [SUn > 1], [Sj\T2 > 1] 
have log-concave distributions. Suppose the same holds true for [SUti > t] 
and [S^\t2 > t]. Consider first [5'/_^]^|ri > t] and [5'f^4_i|T2 > t]. Since, for 
k = 1,2, [5^+1 1 Tfc > t] = [S^\Tk > t] + X^^-^, [S^^-^\Tk > t] is a convolution 
of two random variables with log-concave distributions, implying that it has 



itself a log-concave distribution (Keilson and Sumita, 1982, Lemma p. 387). 
Using (Keilson and Geber 1971, Theorem 2. Id), 

[Sl,\n >t] = [Sl\n >t]+ Xl, <ir [S^\t2 >t]+ Xl, 

<lr[Sl\T2>t]+Xl, = [Sl,\T2>tl 

which follows after verifying that 

1. [SI\ti > t] <ir [St\T2 > t] (by the induction hypothesis). 

2. X^_^-^ has a log-concave distribution and is independent of [SUti > t] 
and [St\T2 > t]. 

3. <lr X'^+i- 

4. [5*^^1x2 > t] has a log-concave distribution and is independent of X^_^_^ 
and X^_^^. 

Let fl_^-j^, f^j^^ denote respectively the PMFs of > t] and [5'^^_|_i|t2 > 

t], and the PMFs of [Sl^-^W > t + I] and [S^+-^T2 > t + I]. Then 

for A; = 1, 2 



t+i 



X 



X < k+i or X > Mt+i, 







f}\_^{x)/ck otherwise. 



where Ck = P^S^^i > Ut+iUSt_^i < lt+i\Tk > t). From this it can be seen that 
if //^_^, f^_^^ are log-concave and satisfy the likelihood ratio order, defined in 
(7), the same holds true for ft+i, ft+i- By induction we deduce that for all 



t, [SI\ti > t -|- 1] <ir [Sf\T2 > t] implying the usual stochastic order (Boland 
et al.[ |2002| p.558) [S}\ti > t + 1] <st [S^\t2 > t + 1]. □ 
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Proof of Theorem^ Let Ut = \Ut\. can be bounded above by 

nt 

T^<j2m"is?)<v] = f\ 

i=rt 

where {S'f*'' : i = rt,...,nt} are the partial sums corresponding to p(^rt) < 
P(rt+i) < ••• < P{nt), the largest ordered p- values. 

Under Hq, p(^i^ > a for i = rt, ...,nt. Let be a partial sum generated 
by a p-value equal to a and let Tq, denote its stopping-time. By Lemma |6} 
we have 

where f(j) is the stopping time of S'f''. 
Therefore, conditional on Ta,f(^i) > t, 

where X is a Bernoulli variable with success probability rj. It follows that 

nt 
i=rt 

where is a Binomial variable with success probability rj and size rit — rt + l. 
Therefore, T+ <f+<stB+. 

The stochastic bound for T~ can be proved by a similar procedure. □ 

C On the midpoint rule 

Proof of Lemma^ Let t e [0,1] and define A{t) = sup{/32 - /3i : = 
t, /9 G A}. This is well-defined because of (ii). The implication from left to 
right follows by the definition of A. 

Let /3 e C : /32 - /32 < A(^^). Let t = As A is compact 

and D = {C^ ^ '■ + C,2 = '^t} is closed, A n D is compact and thus 
{P2 - Pi ■ = t, /3 G A} is compact also. 

Hence, there exists a 7 G A such that (72 + 7i)/2 = t and 72 — 7i = A(t). 
This implies that /3 C 7 using (iii), implying that (3 E A. □ 
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